A cross-cultural investigation of emotion inferences from voice and speech: implications for speech technology

نویسنده

  • Klaus R. Scherer
چکیده

Recent years have seen increasing efforts to improve speech technology tools such as speaker verification, speech recognition, and speech synthesis by taking voice and speech variations due to speaker emotion or attitudes into account. Given the global marketing of speech technology products, it is of vital importance to establish whether the vocal changes produced by emotional and attitudinal factors are universal or vary over cultures and/or languages. The answer to this question determines whether similar algorithms can be used to factor out (or produce) emotional variations across all languages and cultures. This contribution describes the first large-scale effort to obtain empirical data on this issue by studying emotion recognition from voice in nine countries on three different continents. 1. OVERVIEW OF THE ISSUE The important role of emotion in shaping human voice production has been known since antiquity (see [1] for a review of early research). Yet, due to the relative neglect of the issue of emotion-induced voice and speech changes in speech research, little progress has been made in the past 25 years. It has been only rather recently that the important role of emotion and attitude dependent voice and speech variations has found the interest of phoneticians and engineers working on speech synthesis, speech recognition, and speaker verification [2,3]. Currently, a number of efforts are under way to understand the effects of emotion on voice and speech and to examine possibilities to adapt speech technology algorithms accordingly [4-6]. Figure 1: Beta weights and multiple R for selected acoustic parameters predicting emotion judgments in multiple regressions (adapted from [7]). However, one central issue remains largely unexplored: the question of the universality vs. the cultural and/or linguistic relativity of the emotion effects on vocal production. This is obviously of major importance for the development and marketing of speech technology products that have been enhanced to allow the factoring out of emotional or attitudinal variablility in recognition and verification or to produce appropriate acoustic features in synthesis. Obviously, little customization of the respective algorithms would be necessary if emotion effects on the voice were universal whereas culturally or linguistically relative effects would require special adaptations for specific languages or countries. It is of major practical interest, then, to determine whether there are systematic and differentiated effects of different emotions on voice and speech characteristics, whether these are universal or culturally/linguistically relative, and whether intercultural recognition of emotion cues in the voice is possible. The apparent predominance of the activation dimension in vocal emotion expression has often led critics to suggest that judges’ ability to infer emotion from the voice might be limited to basing inference systematically on perceived activation. In consequence, one might expect that if this single, overriding dimension is controlled for, there should be little evidence for judges’ ability to recognize emotion from purely vocal, nonverbal cues. However, evidence from a recent study by Banse & Scherer [7] that used a larger than customary set of acoustic variables, as well as more controlled elicitation techniques, points to the existence of both activation and valence cues to emotion in the voice. Twelve professional actors were asked to portray fourteen different emotions using two standard, meaningless sentences. Their portrayals were based on scenarios provided for each emotion. The actors were asked to use Stanislavski techniques, i.e. attempt to immerse themselves into the scenario and feel the emotion in the process of portraying it. The resulting speech samples were presented to expert judges to eliminate unsuccessful or unnatural expressions. In the next step 0 . 6 0 . 4 0 . 2 0 0 . 2 0 . 4 0 . 6 0 . 8 H o t A n g e r C o l d A n g e r P a n i c F e a r A n x i e t y D e s p e r a t i o n S a d n e s s E l a t i o n H a p p i n e s s I n t e r e s t B o r e d o m S h a m e P r i d e D i s g u s t C o n t e m p t M e a n F 0 S D F 0 M e a n E n e r g y D u r a t i o n V o i c e d S e g m e n t s H a m m a r b e r g I n d e x % E n e r g y > 1 k H z S p e c t r a l s l o p e > 1 k H z M u l t i p l e R naive judges were used to assess the accuracy with which the different speech samples could be recognized with respect to the expressed emotion. 224 portrayals that were recognized with a sufficient level of accuracy were chosen for acoustic analysis. The results of these analyses suggest the existence of emotionspecific vocal profiles that differentiate the different emotions not only on a basic arousal or activation dimension but also with respect to qualitative differences. Figure 2: Accuracy percentages for emotion classification attained by human judges, jackknifing, and discriminant analysis (adapted from [7]). These results explain why studies investigating the ability of listeners to recognize different emotions from a wide variety of standardized vocal stimuli achieve a level of accuracy that largely exeeds what would be expected by chance [8]. The assumption is, of course, that the judges' inference is in fact based on the acoustic cues that determine the vocal profiles for the different emotions. Figure 1, which plots the beta weights from multiple regressions of vocal features on emotion judgment, shows that this is indeed the case: A sizeable proportion of the variance in the judgments can be explained by the acoustic cues that are prominently involved in differentiating vocal emotion profiles. This interpretation is bolstered by a comparison of the confusion matrices produced by human judges with those found for statistical techniques of classification on the basis of predictor variables, namely jack-knifing and disciminant analysis. Figure 2 shows the respective accuracy percentages for these three types of classification. Not only is the pattern of accuracy coefficients across emotions (the diagonals in the confusion matrices as shown in Figure 2) highly comparable (with a few interesting exceptions) but the off-diagonal errors are also very similar across judges and classification algorithms (see [7] for details). This provides further evidence for the assumption that judges base their inference of emotion in the voice on acoustic profiles that are characteristic for specific emotions. Furthermore, many aspects of the profiles identified by Banse & Scherer were predicted by Scherer's Component Process model of emotion [9] that assumes a universal, psychobiological mechanism (push effects) for the vocal expression of emotion (even though it also allows for sociocultural variations -pull effects). This suggests that many of the emotion effects on the voice should be universal. Unfortunately, there is no study to date that examined these effects across speakers from different languages and cultures. In the meantime, it may be useful to examine whether judges from different countries can identify vocal expressions from another language/culture. Whereas the perception of emotion from facial expression has been extensively studied cross-culturally, little is known about the ability of judges from different cultures, speaking different languages, to infer emotion from voice and speech encoded in another language by members of other cultures. This contribution summarizes the results of a series of recognition studies conducted by Scherer, Banse, and Wallbott [10] in nine different countries in Europe, the United States, and Asia (using vocal emotion portrayals with content-free sentences as produced by professional German actors.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Classification of emotional speech using spectral pattern features

Speech Emotion Recognition (SER) is a new and challenging research area with a wide range of applications in man-machine interactions. The aim of a SER system is to recognize human emotion by analyzing the acoustics of speech sound. In this study, we propose Spectral Pattern features (SPs) and Harmonic Energy features (HEs) for emotion recognition. These features extracted from the spectrogram ...

متن کامل

A New Algorithm for Voice Activity Detection Based on Wavelet Packets (RESEARCH NOTE)

Speech constitutes much of the communicated information; most other perceived audio signals do not carry nearly as much information. Indeed, much of the non-speech signals maybe classified as ‘noise’ in human communication. The process of separating conversational speech and noise is termed voice activity detection (VAD). This paper describes a new approach to VAD which is based on the Wavelet ...

متن کامل

Improving of Feature Selection in Speech Emotion Recognition Based-on Hybrid Evolutionary Algorithms

One of the important issues in speech emotion recognizing is selecting of appropriate feature sets in order to improve the detection rate and classification accuracy. In last studies researchers tried to select the appropriate features for classification by using the selecting and reducing the space of features methods, such as the Fisher and PCA. In this research, a hybrid evolutionary algorit...

متن کامل

A Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation

Abstract   Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...

متن کامل

Speech and Reading Disorders Screening, and Problems in Structure and Function of Articulation Organs in Children in Mashhad City, Iran

Background and Objectives: Investigating the prevalence of speech and language disorders and the contributing factors can help determine the best treatment options suited to the needs of these patients. So far, no comprehensive study has been conducted on screening speech and reading disorders and problems in the structure and function of articulation organs (PSFAOs) in children in Mashhad City...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000